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Abstract 

This paper addresses the problem of scal¬ 
able optimization for l\ -regularized conditional 
Gaussian graphical models. Conditional Gaus¬ 
sian graphical models generalize the well-known 
Gaussian graphical models to conditional distri¬ 
butions to model the output network influenced 
by conditioning input variables. While highly 
scalable optimization methods exist for sparse 
Gaussian graphical model estimation, state-of- 
the-art methods for conditional Gaussian graph¬ 
ical models are not efficient enough and more 
importantly, fail due to memory constraints for 
very large problems. In this paper, we propose 
a new optimization procedure based on a New¬ 
ton method that efficiently iterates over two 
sub-problems, leading to drastic improvement 
in computation time compared to the previous 
methods. We then extend our method to scale 
to large problems under memory constraints, us¬ 
ing block coordinate descent to limit memory us¬ 
age while achieving fast convergence. Using syn¬ 
thetic and genomic data, we show that our meth¬ 
ods can solve problems with millions of variables 
and tens of billions of parameters to high accu¬ 
racy on a single machine. 

1 INTRODUCTION 

Sparse Gaussian graphical models (GGMs) 0 have been 
extremely popular as a tool for learning a network structure 
over a large number of continuous variables in many dif¬ 
ferent application domains including neuroscience 0 and 
biology 0.A sparse GGM can be estimated as a sparse in¬ 
verse covariance matrix by minimizing the convex function 
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of li -regularized negative log-likelihood. Highly scalable 
learning algorithms such as graphical lasso J2, QUIC 0, 
and BigQUIC £Q have been proposed to learn the model. 

In this paper, we address the problem of scaling up the op¬ 
timization of sparse conditional GGM (CGGM), a model 
closely related to sparse GGM, to very large problem sizes 
without requiring excessive time or memory. Sparse CG- 
GMs have been introduced as a discriminative extension of 
sparse GGMs to model a sparse network over outputs con¬ 
ditional on input variables El ED- CGGMs can be viewed 
as a Gaussian analogue of conditional random field (6), 
while GGMs are a Gaussian analogue of Markov random 
field. Sparse CGGMs have recently been applied to various 
settings including energy forecasting mo , finance m, and 
biology ED, where the goal is to predict structured outputs 
influenced by inputs. A sparse CGGM can be estimated 
by minimizing a convex function of ^-regularized negative 
log-likelihood. This optimization problem is closely related 
to that for sparse GGMs because CGGMs also model the 
network over outputs. However, the presence of the addi¬ 
tional parameters in CGGMs for the functional mapping 
from inputs to outputs makes the optimization significantly 
more complex than in sparse GGMs. 

Several different approaches have been previously pro¬ 
posed to estimate sparse CGGMs, including OWL-QN 0, 
accelerated proximal gradient method ED, and Newton 
coordinate descent algorithm ifTOl . In particular, the New¬ 
ton coordinate descent algorithm extends the QUIC algo¬ 
rithm 0 for sparse GGM estimation to the case of CG¬ 
GMs, and has been shown to have superior computational 
speed and convergence. This approach finds in each iter¬ 
ation a descent direction by minimizing a quadratic ap¬ 
proximation of the original negative log-likelihood func¬ 
tion along with l\ regularization. Then, the parameter esti¬ 
mate is updated with this descent direction and a step size 
found by line search. 

Although the Newton coordinate descent method ED is 
state-of-the-art for its scalability and fast convergence, it is 
still not efficient enough to be applied to many real-world 
problems even with tens of thousands of variables. More 
importantly, it suffers from a large space requirement, be- 
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cause for very high-dimensional problems, several large 
dense matrices need to be precomputed and stored during 
optimization. For a CGGM with p inputs and q outputs, the 
algorithm requires storing several p x p and q x q dense 
matrices, which cannot fit in memory for large p and q. 

We propose new algorithms for learning l \-regularized CG¬ 
GMs that significantly improve the computation time of the 
previous Newton coordinate descent algorithm and also re¬ 
move the large memory requirement. We first propose an 
optimization method, called an alternating Newton coordi¬ 
nate descent algorithm, for improving computation time. 
Our algorithm is based on the key observation that the 
computation simplifies drastically, if we alternately opti¬ 
mize the two sets of parameters for output network and for 
mapping inputs to outputs, instead of updating all parame¬ 
ters at once as in the previous approach. The previous ap¬ 
proach updated all parameters simultaneously by forming 
a second-order approximation of the objective on all pa¬ 
rameters, which requires an expensive computation of the 
large Hessian matrix of size (p + q) x (p + q) in each 
iteration. Our approach of alternate optimization forms a 
second-order approximation only on the network parame¬ 
ters, which requires the Hessian of size q x q, as the other 
set of parameters can be updated easily using a simple co¬ 
ordinate descent. 

In order to overcome the constraint on the space require¬ 
ment, we then extend our algorithm to an alternating New¬ 
ton block coordinate descent method that can be applied 
to problems of unbounded size on a machine with limited 
memory. Instead of recomputing each element of the large 
matrices on demand, we divide the parameters into blocks 
for block-wise updates such that the results of computa¬ 
tion can be reused within each block. Block-wise parame¬ 
ter updates were previously used in BigQUIC J4| for learn¬ 
ing a sparse GGM, where the block sparsity pattern of the 
network parameters was leveraged to overcome the space 
limitations. We propose an approach for block-wise update 
of the output network parameters in CGGMs that extends 
their idea. We then propose a new block-wise update strat¬ 
egy for the parameters for mapping inputs to outputs. In 
our experiments, we show that we can solve problems with 
a million inputs and hundreds of thousands of outputs on a 
single machine. 

The rest of the paper is organized as follows. In Section [2] 
we provide a brief review of sparse CGGMs and the current 
state-of-the-art Newton coordinate descent algorithm in 
for learning the models. In Section [3] we propose an al¬ 
ternating Newton coordinate descent algorithm that signifi¬ 
cantly reduces computation time compared to the previous 
method. In Section [4] we further extend our algorithm to 
perform block-wise updates in order to scale up to very 
large problems on a machine with bounded memory. In 
Section[5j we demonstrate our proposed algorithms on syn¬ 
thetic and real-world genomic data. 


2 BACKGROUND 

2.1 The Conditional Gaussian Graphical Model 

A CGGM (8] [TO] models the conditional probability den¬ 
sity ofx <E M.P given y £ M 9 as follows: 

p(y|x; A, 0) = exp{—y T Ay - 2x T 0y}/Z(x), 

where A is a q x q matrix for modeling the network 
over y and 0 is a p x q matrix for modeling the 
mapping between the input variables x and output vari¬ 
ables y. The normalization constant is given as Z(x) = 
c| A| exp(x T 0A~ 1 0 T x). Inference in a CGGM gives 
p(y|x) = A/”(Bx, A^ 1 ), where B = —©A” 1 , showing 
the connection between a CGGM and multivariate linear 
regression. 

Given a dataset of X £ R nxp and Y £ R nX9 for n sam¬ 
ples, and their covariance matrices S xx = -X T X, S xy = 
^X t Y, S yy = ^Y t Y, a sparse estimate of CGGM pa¬ 
rameters can be obtained by minimizing li -regularized neg¬ 
ative log-likelihood: 

mmf(A,&) = g(A,&) + h(A,@), (1) 

AxO,® 

where g( A, 0) = —log |A| + tr(S yy A + 2S xy T 0 + 
A ' 1 0 7 S xx 0) for the smooth negative log-likelihood and 
h( A, 0)=Aa||A||i + A©||0||i for the non-smooth elemen¬ 
twise l\ penalty. Aa, A© > 0 are regularization parameters. 
As observed in previous work mmm. this objective is 
convex. 

2.2 Optimization 

The current state-of-the-art method for solving Eq. ([T} for 
Zi-regularized CGGM is the Newton coordinate descent 
algorithm ifTOl that extends QUIC [0 for l \-regularized 
GGM estimation. In each iteration, this algorithm found a 
generalized Newton descent direction by forming a second- 
order approximation of the smooth part of the objective and 
minimizing this along with the l \ penalty. Given this New¬ 
ton direction, the parameter estimates were updated with a 
step size found by line search using Armijo’s rule 01- 

In each iteration, the Newton coordinate descent algorithm 
found the Newton direction as follows: 

D a ,D© = argmin5A,®(A A , A©) 

Aa,A@ 

+ h(A + Aa, 0 + A©), (2) 

where <?a,® is the second-order approximation of g given 
by Taylor expansion: 

5a,®(A a , A©) = vec(Vg(A, 0)) T vec([A A A©]) 

+ ^vec([A A A©]) T V 2 g(A, 0) vec([A A A©]). 
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The gradient and Hessian matrices above are given as: 


V«?( A,0) = 


V 2 «?(A,0) = 


[Va<?(A,0) 

[Syy-S-Vt 

Vis( A,©) 
V a Vq(?(A, 0) t 

'S® (£ + 2^) 
— 2 S <g> r 


V©<?(A,0)] 


2S xy + 2 r] (3) 

VaV©3(A, 0) 

V 2 0 5(A,©) _ 


- 2 £ ( 8 ) r T ' 
2£ (8 S xx 


, (4) 


where £ = A -1 , = £© 7 S xx 0S, and T = S xx ©£. 

Given the Newton direction in Eq. ([2]), the parameters can 
be updated as A •<— A + aD A and ©•(—© + aD©, where 
step size 0 < a < 1 ensures sufficient decrease in Eq. 0 
and positive definiteness of A. 

The Lasso problem j9) in Eq. 0 was solved using coordi¬ 
nate descent. Despite the efficiency of coordinate descent 
for Lasso, applying coordinate updates repeatedly to all 
q 2 + pq variables in A and © is costly. So, the updates 
were restricted to an active set of variables given as: 


Sa = {(Aa)jj : |(Vas(A, 0))ij| > Aa V A^ ^ 0} 

S® = {(A ®)ij : |(V©<?(A, ©))y| > A© V ©^ ^ 0}. 

Because the active set sizes ?ba = 5a |, m© = |5©| ap¬ 
proach the number of non-zero entries in the sparse solu¬ 
tions for A* and 0* over iterations, this strategy yields a 
substantial speedup. 

To further improve the efficiency of coordinate descent, in¬ 
termediate results were stored for the large matrix products 
that need to be computed repeatedly. At the beginning of 
the optimization for Eq. 0, U := A a £ and V := A©£ 
were computed and stored. Then, after a coordinate descent 
update to (Aa)*?-, row i and j of U were updated. Simi¬ 
larly, after an update to (A©)y, row / of V was updated. 


2.3 Computational Complexity and Scalability 

Although the Newton coordinate descent method is com¬ 
putationally more efficient than other previous approaches, 
it does not scale to problems even with tens of thousands 
of variables. The main computational cost of the algorithm 
comes from computing the large (p + q) x (p + q) Hes¬ 
sian matrix in Eq. 0 in each application of Eq. 0 to 
find the Newton direction. At the beginning of the opti¬ 
mization in Eq. 0, large dense matrices £, M/, and T, for 
computing the gradient and Hessian in Eqs. 0 and 0. 
are precomputed and reused throughout the coordinate de¬ 
scent iterations. Initializing £ = A -1 via Cholesky de¬ 
composition costs up to 0{q 3 ) time, although in practice, 
sparse Cholesky decomposition exploits sparsity to invert 
A in much less than 0{q 3 ). Computing = R r R, where 
R = X©£, requires 0(nm®+nq 2 ) time, and computing 
r costs 0(npq + nq 2 ). After the initializations, the cost of 
coordinate descent update per each active variable (Aa )ij 
and (A©)jj is ()(p + q). During the coordinate descent for 


solving Eq. 0, the entire (p + q) x (p + q) Hessian matrix 
in Eq. 0 needs to be evaluated, whereas for the gradient in 
Eq. 0 only those entries corresponding to the parameters 
in active sets are evaluated. 

A more serious obstacle to scaling up to problems with 
large p and q is the space required to store dense matri¬ 
ces £ (size qxq), (size q x q), and T (size p x q). In our 
experiments on a machine with 104 Gb RAM, the Newton 
coordinate descent method exhausted memory when p + q 
exceeded 80,000. 

In the next section, we propose a modification of the 
Newton coordinate descent algorithm that significantly im¬ 
proves the computation time. Then, we introduce block- 
wise update strategies to our algorithm to remove the mem¬ 
ory constraint and to scale to arbitrarily large problem 
sizes. 

3 ALTERNATING NEWTON 
COORDINATE DESCENT 

In this section, we introduce our alternating Newton co¬ 
ordinate descent algorithm for learning an l \-regularized 
CGGM that significantly reduces computation time com¬ 
pared to the previous method. Instead of performing New¬ 
ton descent for all parameters A and 0 simultaneously, our 
approach alternately updates A and ©, optimizing Eq. 0 
over A given © and vice versa until convergence. 

Our approach is based on the key observation that with A 
fixed, the problem of solving Eq. 0 over © becomes sim¬ 
ply minimizing a quadratic function with li regularization. 
Thus, it can be solved efficiently using a coordinate descent 
method, without the need to form a second-order approx¬ 
imation or to perform line search. On the other hand, op¬ 
timizing Eq. 0 for A given © still requires forming a 
quadratic approximation to find a generalized Newton di¬ 
rection and performing line search to find the step size. 
However, this computation involves only qxq Hessian ma¬ 
trix and is significantly simpler than performing the same 
type of computation on both A and © jointly as in the pre¬ 
vious approach. 

3.1 Coordinate Descent Optimization for A 

Given fixed ©, the problem of minimizing the objective in 
Eq.0 with respect to A becomes 

argming©(A) + A a ||A||i, (5) 

A^O 

where //©(A) = — log |A| + tr(S yy A + A“ 1 © T S xx 0). 
In order to solve this, we first find a generalized Newton di¬ 
rection that minimizes the l\ -regularized quadratic approx¬ 
imation of g® (A): 

D a = argmin g\,@ ( A a ) + A a || A + A a ||i, (6) 

Aa 
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Algorithm 1: Alternating Newton Coordinate Descent 
input : Inputs X G R nxp and Y e M"* 9 ; regularization 
parameters Aa, A© 
output: Parameters A, © 

Initialize 0 •<— 0, A <— I q 
for t = 0,1,... do 

Determine active sets S \, S@ 

Solve via coordinate descent: 

D a = argmin g At @{A + A a ,0) + /i(A+A a ,0) 

^ A ’^SX=0 

Update A + = A + aD A , where step size a is found 
with line search 
Solve via coordinate descent: 

©+ = argmin© Se g A {&) + A©||©||i 


where g A ,&( A a ) is obtained from a second-order Taylor 
expansion and is given as 

5 a,©(A a ) = vec(V A g(A, 0)) T vec(A A ) 

+ ^ vec(A A ) T V A p(A, 0) vec(A A ). 

The V A p(A, 0) and X\g(A. 0) above are components 
of the gradient and Hessian matrices corresponding to A 
in Eqs. © and ©. We solve the Lasso problem in Eq. © 
via coordinate descent. Similar to the Newton coordinate 
descent method, we maintain U := A a S to reuse inter¬ 
mediate results of the large matrix-matrix product. Given 
the Newton direction for A, we update A •<— A + uA a , 
where a is obtained by line search. 

Restricting the generalized Newton descent to A simpli¬ 
fies the computation significantly for coordinate descent 
updates, compared to the previous approach ma that ap¬ 
plies it to both A and © jointly. Our updates only in¬ 
volve V A g(A, 0) and V^p(A, 0), and no longer involve 
V©g(A, 0) and V a V©<?(A, 0), eliminating the need to 
compute the large p x q dense matrix T in 0(npq + nq 2 ) 
time. Our approach also reduces the computational cost for 
the coordinate descent update of each element of A a from 
0(jp + q) to 0{q). 

3.2 Coordinate Descent Optimization for © 

With A fixed, the optimization problem in Eq. © with re¬ 
spect to © becomes 

argmin g A (0) + A©||0||i, (7) 

© 

where g A (0) = tr(2S xy 5 © + A~ 1 0 T S xx 0). Since 
g a( 0) is a quadratic function itself, there is no need to 
form its second-order Taylor expansion or to determine a 
step size via line search. Instead, we solve Eq. 0 directly 
with coordinate descent method, storing and maintaining 
V := ©S. Our approach reduces the computation time for 


updating © compared to the corresponding computation in 
the previous algorithm. We avoid computing the large px q 
matrix I\ which had dominated overall computation time 
with 0(npq). Our approach also eliminates the need for 
line search for updating 0. Finally, it reduces the cost for 
each coordinate descent update in © to 0(p ), compared to 
()(p + q) for the corresponding computation for A© in the 
previous method. 

Our approach is summarized in Algorithm© We provide 
the details of the coordinate descent update equations in 
Appendix. For approximately solving Eqs. © and (jTJ, we 
warm-start A and © from the results of the previous it¬ 
eration and make a single pass over the active set, which 
ensures decrease in the objective in Eq. ©I and reduces the 
overall computation time in practice. 

4 ALTERNATING NEWTON BLOCK 
COORDINATE DESCENT 

The alternating Newton coordinate descent algorithm in 
the previous section improves the computation time of the 
previous state-of-the-art method, but is still limited by the 
space required to store large matrices during coordinate de¬ 
scent computation. Solving Eq. © for updating A requires 
precomputing and storing qxq matrices, S and \E, whereas 
solving Eq. © for updating © requires S and a px p ma¬ 
trix S xx . A naive approach to reduce the memory footprint 
would be to recompute portions of these matrices on de¬ 
mand for each coordinate update, which would be very ex¬ 
pensive. 

In this section we describe how our algorithm in the previ¬ 
ous section can be combined with block coordinate descent 
to scale up the optimization to very large problems on a 
machine with limited memory. During coordinate descent 
optimization, we update blocks of A and 0 so that within 
each block, the computation of the large matrices can be 
cached and re-used, where these blocks are determined au¬ 
tomatically by exploiting the sparse stucture. For A, we ex¬ 
tend the block coordinate descent approach in BigQUIC j4j 
developed for GGMs to take into account the conditioning 
variables in CGGMs. For 0, we describe a new approach 
for block coordinate descent update. Our algorithm can, in 
principle, be applied to problems of any size on a machine 
with limited memory. 

4.1 Blockwise optimization for A 

4.1.1 Block coordinate descent method 

A coordinate-descent update of (A A )y requires the /th and 
jth columns of S and IE. If these columns are in memory, 
they can be reused. Otherwise, it is a cache miss and we 
should compute them on demand. S, : for the /th column 
of S can be obtained by solving linear system AS,; = e. 
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Algorithm 2: Alternating Newton Block Coordinate De¬ 
scent_ 

input : Inputs X eR nxp and outputs Y eR nxq ; 

regularization parameters Aa, A© 
output: Parameters A, © 

Initialize 0 0, A <— I q 

for t = 0,1,... do 

Determine active sets S \, S® 

Partition columns of A into k A blocks 

> Minimize over A 

Initialize Aa t— 0 

for 2 = 1 to k A do 

Compute Xp z , Ucy, and IPcy 

for r = 1 to k A do 
if ;/r then 

Identify columns B zr C C, with active 
elements in A 

Compute X s zr , U b zt , and IP B zr 
Update all active (A A )ij in ( C z ,C r ) 

Update A + A + qAa, where step size a is found 
with line search. 

Partition columns of © into k © blocks 

> Minimize over © 

for r = 1 to fc© do 

Compute Ec r , and initialize V <— ©Xcy 
for row i G {1,... ,p} if /,>(<%,cy)) do 

Compute (S xx ),;j for non-empty rows j in Vc r 
Update all active 0, ; in (i,C r ) 


with conjugate gradient method in 0{m A K ) time, where 
K is the number of conjugate gradient iterations. Then, IP, 
can be obtained from R t Rj;, where R = X©X in 0(nq) 
time. 

In order to reduce cache misses, we perform block coor¬ 
dinate descent, where within each block, the columns of 
X are cached and re-used. Suppose we partition Af = 
{1,..., q} into k A blocks, Ci ,..., Cfc A . We apply this 
partitioning to the rows and columns of Aa to obtain 
/.'a x k A blocks. We perform coordinate-descent updates 
in each block, updating all elements in the active set within 
that block. Let A c r denote a q by \C r \ matrix contain¬ 
ing columns of A that corresponds to the subset C r . In 
order to perform coordinate-descent updates on ( C ri C z ) 
block of Aa, we need Xcy, IPcy, and IPcy. Thus, 
we pick the smallest possible k A such that we can store 
2q/k\ columns of X and 2q/k A columns of tP in mem¬ 
ory. When updating the variables within block (C z ,C r ) 
of Aa, there are no cache misses once £cy, ^cy, X Pcy , 
and SPcr are computed and stored. After updating each 
(A a )« to (Aa)zj + p, we maintain U c z and U c r by 
U a U,;t + pYjjt, U jt <— U jt + pTju, Vf € {C z U Cr}. 



Figure 1: Schematic of block coordinate descent for A. 
The A of size q = 9 is updated for each of the k A blocks 
in turn with k A = 3. Filled elements denote the parameters 
in the active set. The green arrows denote rows of S and IP 
that are computed once and reused while sweeping through 
a row of blocks. The red arrows denote cache misses and 
the corresponding columns of S and IP need to be recom¬ 
puted. 



Figure 2: Schematic of block coordinate descent for ©. 
The © of size p = 8, q = 6 is updated for each of the p x k® 
blocks with k® = 2. Filled elements denote the parameters 
in the active set. Green arrows denote columns of S that 
are computed once and reused while sweeping through the 
column of p blocks. The red arrows denote cache misses 
for (S xx )j. 


To go through all blocks, we update blocks 
(C z ,Ci),...,(C z ,C k ) for each 2 G {1, ...,fc A }. 
Since all of these blocks share S c z and IP c z , we pre¬ 
compute and store them in memory. When updating an 
off-diagonal block (C z , C r ),z / r, we compute Xand 
IPcy- In the worst case, overall X and \P will be computed 
k’A times. 

4.1.2 Reducing computational cost using graph 
clustering 

In typical real-world problems, the graph structure of A 
will exhibit clustering, with an approximately block di¬ 
agonal structure. We exploit this structure by choosing a 
partition {Ci,..., Ck A ] that reduces cache misses. Within 
diagonal blocks (C Z ,C Z )’ s, once Xr.\, and IP c z are com¬ 
puted, there are no cache misses. For off-diagonal blocks 
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(C z , C r )’ s, r ^ z, we have a cache miss only if some vari¬ 
able in {Aij \i £ C z ,j £ C r } lies in the active set. We thus 
minimize the active set in off-diagonal blocks via cluster¬ 
ing, following the strategy for sparse GGM estimation in 
|4l . In the best case, if all parameters in the active set ap¬ 
pear in the diagonal blocks, X and 'T are computed only 
once with no cache misses. We use the METIS (5) graph 
clustering library. Our method for updating A is illustrated 
in Figure [T] 

4.2 Blockwise Optimization for © 

4.2.1 Block coordinate descent method 

The coordinate descent update of ®ij requires (S xx ) t and 
Sj to compute (S xx )fVj, where Vj = ©Xj. If (S xx )i 
and S j are not already in the memory, it is a cache miss. 
Computing (S xx )* takes 0(np), which is expensive if we 
have many cache misses. 

We propose a block coordinate descent approach for solv¬ 
ing Eq. 0 that groups these computations to reduce 
cache misses. Given a partition {1,..., q} into k@ subsets, 
Ci,..., Ck & , we divide © into px k& blocks, where each 
block comprises a portion of a row of ©. We denote each 
block ( i,C r ), where i £ {1,... ,p}. Since updating block 
(i, C r ) requires (S xx )., and X< 7 r , we pick smallest possible 
k(~) such that we can store q/k@ columns of S in memory. 
While performing coordinate descent updates within block 
(i, C r ) of ©, there are no cache misses, once (S xx )and 
Hc r are in memory. After updating each © )? to ®, :) + /;, 
we update Vc r by V it +- V it + £ C r . 

In order to sweep through all blocks, each time we select 
a q £ {1,..., fc®} and update blocks (1, C r ),..., (p, C r ). 
Since all of these p blocks with the same C r share the com¬ 
putation of Sc r , we compute and store Sc r in memory. 
Within each block, the computation of (S xx ), is shared, 
so we pre-compute and store it in memory, before updat¬ 
ing this block. The full matrix of X will be computed once 
while sweeping through the full ©, whereas in the worst 
case S xx will be computed k@ times. 

4.2.2 Reducing computational cost using row-wise 
sparsity 

We further reduce cache misses for (S xx ).j by strategically 
selecting partition Ci,..., Ck @ , based on the observation 
that if the active set is empty in block ( i,C r ), we can 
skip this block and forgo computing (S xx ),. We therefore 
choose a partition where the active set variables are clus¬ 
tered into as few blocks as possible. Formally, we want to 
minimize g |I,>G%cy))|, where I^S^Cr)) is an indi¬ 
cator function that outputs 1 if the active set c r ) within 
block (*, C ' r ) is not empty.We therefore perform graph clus¬ 
tering over the graph G = ( V. E) defined from the active 
set in ©, where V = {L..... r/ } with one node for each col¬ 


umn of ©, and E = {(j. /,') |0 ;J £ S@, ©,/,■ £ S@ for i = 
1 ,,p}, connecting two nodes j and k with an edge if 
both ®ij and ©,/, are in the active set. This edge set corre¬ 
sponds to the non-zero elements of ® 7 ©, so the graph can 
be computed quickly in 0(m@q). 

We also exploit row-wise sparsity in 0 to reduce the cost of 
each cache miss. Every empty row in © corresponds to an 
empty row in V = ©X. Because we only need elements in 
(S xx )j for the dot product (S xx )fVj, we skip computing 
the fcth element of f S xx ), if the fcth row of 0 is all zeros. 
Our strategy for updating © is illustrated in Figure [2] 

Our method is summarized in Algorithm [2] See Appendix 
for analysis of the computational cost. 

4.3 Parallelization 

The most expensive computations in our algorithm are em¬ 
barrassingly parallelizable, allowing for further speedups 
on machines with multiple cores. Throughout the algo¬ 
rithm, we parallelize matrix and vector multiplications. In 
addition, for block-wise A updates, we compute multiple 
columns of X c z and y P c z as well as multiple columns of 
"Sc,, and \kcy for multiple cache misses in parallel, run¬ 
ning multiple conjugate gradient methods in parallel. For 
block-wise © updates, we compute multiple columns of S 
in parallel before sweeping through blocks and perform a 
parallel compuation within each cache miss, computing el¬ 
ements within each S xx , in parallel. 

5 EXPERIMENTS 

We compare the performance of our proposed methods 
with the existing state-of-the-art Newton coordinate de¬ 
scent algorithm, using synthetic and real-world genomic 
datasets. All methods were implemented in C++ with pa¬ 
rameters represented in sparse matrix format. All experi¬ 
ments were run on 2.6GHz Intel Xeon E5 machines with 
8 cores and 104 Gb RAM, running Linux OS. We run the 
Newton coordinate descent and alternating Newton coordi¬ 
nate descent algorithms as a single thread job on a single 
core. For our alternating Newton block coordinate descent 
method, we run it on a single core and with parallelization 
on 8 cores. 

5.1 Synthetic Experiments 

We compare the different methods on two sets of syn¬ 
thetic datasets, one for chain graphs and another for ran¬ 
dom graphs with clustering for A, generated as follows. 
For chain graphs, the true sparse parameters A is set with 
Ki_i ~i = 1 and A t l - 2.25 and the ground truth © is set 
with ® i i = 1. We perform one set of chain graph experi¬ 
ments where the number of inputs p equals the number of 
outputs q , and another set of experiments with an additional 
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Figure 3: Comparison of scalability on chain graphs. We vary p and q, where (a ) p = q and (b) p = 2 q. The Newton 
coordinate descent and alternating Newton coordinate descent methods could not be run beyond the problem sizes shown 
due to memory constraint, (c) Convergence when q = 20,000 and p = 40,000. 



Figure 4: Comparison of scalability on random graphs with clustering, (a) Varying p with q fixed at 10,000. (b) Varying q 
with p fixed at 40,000. (c) Active set size versus time with p = 20,000 and q = 10,000. 


q irrelevant features unconnected to any outputs, so that p 
equals 2 q. For random graphs with clustering, following the 
procedure in 0 for generating a GGM, we set the true A 
to a graph with clusters of nodes of size 250 and with 90% 
of edges connecting randomly-selected nodes within clus¬ 
ters. We set the number of edges so that the average degree 
of each node is 10, with edge weights set to 1. We then 
set the diagonal values so that A is positive definite. To 
set the sparse patterns for ©, we randomly select 100 
input variables as having edges to at least one output and 
distribute total lOq edges among those selected inputs to in¬ 
fluence randomly selected outputs. We set the edge weights 
of 0 to 1. 

Then, we draw samples from the CGGM defined by these 
true A and ©. We generate datasets with n = 100 sam¬ 
ples for the chain graphs and n = 200 samples for ran¬ 
dom graphs with clustering. We choose Aa and A© so 
that the number of estimated edges in A and © is close 
to ground truth. Following the strategy used in GGM esti¬ 
mation 0. we use the minimum-norm subgradient of the 
objective as our stopping criterion: ||grad 5 (A t , ©t)||i < 
0-01(|| A|| ! + || © || )- 

We compare the scalability of the different methods for 
chain graph experiments, as we vary the problem sizes. We 
show the computation times for datasets with p = q in Fig¬ 
ure |3ja) and for datasets with p = 2q with q additional 
irrelevant features in Figure[3jb). For large problems, com¬ 
putation times are not shown for Newton coordinate de¬ 


scent and alternating Newton coordinate descent methods 
because they could not complete the optimization with lim¬ 
ited memory. Also, for large problems, alternating block 
coordinate descent was terminated after 60 hours of com¬ 
putation. We provide more results on varying the sample 
size n in Appendix. In Figure [ijc), using the dataset with 
p = 40,000 and q = 20,000, we plot the suboptimality in 
the objective / — /*, where /* is obtained by running alter¬ 
nating Newton coordinate descent algorithm to numerical 
precision. Our new methods converge substantially faster 
than the previous approach, regardless of desired accuracy 
level. We notice that as expected from the convexity of the 
optimization problem, all algorithms converge to the global 
optimum and find nearly identical parameter estimates. 

In Figure [4j we compare scalability of different methods 
for random graphs with clustering. In Figure |4| a), we vary 
p, while setting q to 10,000. In Figure[4]b), we vary q, fix¬ 
ing p to 40,000. Similar to the results from chain graph, for 
larger problems, Newton coordinate descent and alternat¬ 
ing Newton coordinate descent methods ran out of memory 
and alternating block coordinate descent was terminated af¬ 
ter 60 hours. For all problem sizes, our alternating New¬ 
ton coordinate descent algorithm significantly reduces the 
computation time of the previous method, the Newton coor¬ 
dinate descent algorithm. This gap in the computation time 
increases for larger problems. In Figure |4fc), we compare 
the convergence in sparsity pattern for the different meth¬ 
ods as measured by the active set size, for problem size 
p = 20,000 and q = 10,000. All our methods recover the 
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Table 1: Computation time in hours on genomic dataset. indicates running out of memory. 


p 

q 

l|A*||o 

ll©*llo 

Newton CD 

Alternating Newton CD 

Alternating Newton BCD 

34,249 

3,268 

34,914 

28,848 

22.0 

0.51 

0.24 

34,249 

10,256 

86,090 

103,767 

> 50 

2.4 

2.3 

442,440 

3,268 

26,232 

30,482 

* 

* 

11 



Figure 5: Speedup with parallelization for alternating New¬ 
ton block coordinate descent. 

optimal sparsity pattern much more rapidly than the previ¬ 
ous approach. 

Figures [3] and [4] show that our alternating Newton block 
coordinate descent can run on much larger problems than 
any other methods, while those methods without block 
coordinate descent run out of memory. For example, in 
Figure [4jli) alternating Newton block coordinate descent 
could handle problems with one million inputs, while with¬ 
out block-wise optimization it ran out of memory when 
p > 100,000. We also notice that on a single core, the alter¬ 
nating Newton block coordinate descent is slighly slower 
than the same method without block-wise optimization be¬ 
cause of the need to recompute S and S xx . However, it is 
still substantially faster than the previous method. 

Finally, we evaluate the parallelization scheme for our 
alternating Newton block coordinate descent method on 
multi-core machines. Given a dataset generated from chain 
graph with p = 40,000 and q = 20,000, in Figure [5] we 
show the folds of speedup for different numbers of cores 
with respect to a single core. We obtained about 7 times 
speed up with 8 cores on a machine with 104Gb RAM, 
and about 12 times speedup with 16 cores on a machine 
with 28Gb RAM. In general, we observe greater speedup 
on larger problem sizes and also for random graphs, be¬ 
cause such problems tend to have more cache misses that 
can be computed in parallel. 

5.2 Genomic Data Analysis 

We compare the different methods on a genomic dataset. 
The dataset consists of genotypes for 442,440 single nu¬ 
cleotide polymorphisms (SNPs) and 10,256 gene expres¬ 
sion levels for 171 individuals with asthma, after removing 
genes with variance under 0.01. We fit a sparse CGGM us¬ 
ing SNPs as inputs and expressions as outputs to model 
a gene network influenced by SNPs. We also compared the 



time (hours) 

(b) 

Figure 6: Convergence results on genomic dataset, (a) Sub¬ 
optimality and (b) active set size over time. 

methods on a smaller dataset of 34,249 SNPs from chromo¬ 
some 1 and 3,268 genes with variance > 0.1. As typically 
sparse model structures are of interests in this type of anal¬ 
ysis, we chose regularization parameters so that the number 
of non-zero entries in each of © and A at convergence was 
approximately 10 times the number of genes. 

The computation time of different methods are provided 
in Table |T] On the largest problem, the previous approach 
could not run due to memory constraint, whereas our block 
coordinate descent converged in around 11 hours. We also 
compare the convergence of the different methods on the 
dataset with 34,249 SNPs and 3,268 gene expressions in 
Figure [6] and find that our methods provide vastly superior 
convergence than the previous method. 

6 CONCLUSION 

In this paper, we addressed the problem of large-scale op¬ 
timization for sparse CGGMs. We proposed a new opti¬ 
mization procedure, called alternating Newton coordinate 
descent, that reduces computation time by alternately op¬ 
timizing for the two sets of parameters A and ©. Further, 
we extended this with block-wise optimization so that it 
can run on any machine with limited memory. 
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A Appendix 


A.l Coordinate Descent Updates for Alternating Newton Coordinate Descent Method 


In our alternating Newton coordinate descent algorithm, each element of A a is updated as follows: 

(Aa )ij <-(Aa )ij — CA + S\ A /a A (cA - 

U\ 

where S r (w) = sign(w) max(|tu| — r, 0) is the soft-thresholding operator and 
"a = ^ij + 

= (S yy )ij — Sy — ij + (SAaS)ij + (HIAaS)^ + (\&AAS)ji 
ca —A ij (A A )ij. 


For ©, we perform coordinate-descent updates directly on 0 without forming a second-order approximation of the log- 
likelihood to find a Newton direction, as follows: 


where 


&ij ®ij — c® + S\ & / a& (c@ 


b&_ 

a® 


), 


u© =Sjj(S xx )22 

b® =2(S xy )ij + 2(S xx 0S)ij 

C® —• 



n (samples) 



(a) (b) 

Figure 7: Results from varying sample size n on chain graph with p = q = 10,000. (a) Comparison of computation time 
of different methods, (b) Comparison of edge recovery accuracy as measured by F-\ -score. 


A.2 Additional Results from Synthetic Data Experiments 

We compare the performance of the different algorithms on synthetic datasets with different sample sizes n, using a chain 
graph structure with p = q = 10,000. Figure^a) shows that our methods run significantly faster than the previous method 
across all sample sizes. In Figure [7jb) we measure the accuracy in recovering the true chain graph structure in terms of 
Fi -score for different sample sizes n. At convergence, F-\ -score was the same for all methods to three significant digits. As 
expected, the accuracy improves as the sample size increases. 

















